Training course: Plotting Data for Communication and Exploration

Dianne Cook
Monash University
Produced for e61, September 23, 2024

Session 1: Creating communication graphics (mostly)


timing topic
15 Organising your data for efficient plot descriptions
15 Grammatical descriptions for plots
30 Cognitive perception principles
15 Polishing your plots
30 Adding interactivity

Organising your data

Tidy format

What are the variables? WHO Tuberculosis Notifications

Rows: 16
Columns: 22
$ year         <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012
$ new_sp       <dbl> 226, 203, 285, 251, 228, 210, 113, 285, 241, 269, 281, 299, 267, 274, 301, 290
$ new_sp_m04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, NA, 0, 0, 0, 2
$ new_sp_m514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 3, NA, 3, 2, 2, 1
$ new_sp_m014  <dbl> 1, 0, 0, 3, 1, 1, 0, 0, 0, 1, 3, 2, 3, 2, 2, 3
$ new_sp_m1524 <dbl> 8, 11, 13, 16, 23, 15, 14, 18, 32, 33, 30, 46, 30, 42, 38, 26
$ new_sp_m2534 <dbl> 24, 22, 40, 35, 20, 20, 10, 16, 27, 35, 33, 33, 37, 33, 44, 40
$ new_sp_m3544 <dbl> 18, 18, 54, 25, 18, 26, 2, 17, 23, 23, 20, 20, 16, 22, 26, 17
$ new_sp_m4554 <dbl> 13, 13, 52, 24, 18, 19, 11, 15, 11, 21, 15, 27, 24, 25, 19, 25
$ new_sp_m5564 <dbl> 17, 15, 37, 19, 13, 13, 5, 11, 12, 16, 14, 23, 12, 9, 12, 16
$ new_sp_m65   <dbl> 28, 31, 49, 49, 35, 34, 30, 32, 30, 43, 37, 42, 34, 27, 37, 37
$ new_sp_mu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0
$ new_sp_f04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, NA, 1, 1, 2, 0
$ new_sp_f514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 4, NA, 3, 3, 1, 1
$ new_sp_f014  <dbl> 0, 2, 0, 0, 1, 0, 0, 0, 2, 2, 4, 3, 4, 4, 3, 1
$ new_sp_f1524 <dbl> 10, 19, 10, 15, 21, 15, 9, 6, 18, 18, 26, 27, 31, 36, 26, 27
$ new_sp_f2534 <dbl> 15, 24, 16, 19, 27, 21, 13, 17, 26, 27, 37, 32, 27, 43, 40, 48
$ new_sp_f3544 <dbl> 9, 15, 18, 12, 16, 15, 3, 5, 11, 14, 20, 14, 14, 12, 23, 15
$ new_sp_f4554 <dbl> 5, 8, 6, 15, 7, 6, 5, 7, 10, 7, 12, 6, 12, 2, 7, 11
$ new_sp_f5564 <dbl> 10, 2, 2, 5, 8, 4, 4, 3, 6, 9, 7, 11, 11, 5, 7, 9
$ new_sp_f65   <dbl> 12, 24, 26, 14, 20, 23, 7, 19, 14, 21, 23, 10, 12, 12, 17, 15
$ new_sp_fu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0

variables are:
# year, sex, age category
tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
glimpse(tb)
tb_dt <- fread("data/TB_notifications_2023-08-21.csv")
tb_dt <- tb_dt[country == "Australia" & year > 1996 & year < 2013, 
  c(6, 26:44)]
* Import the CSV file
import delimited "data/TB_notifications_2023-08-21.csv", clear

* Filter for Australia and years after 1996
keep if country == "Australia" & year > 1996 & year < 2013

* Keep only the year and variables containing "new_sp"
ds year *new_sp*, has(varl)
keep `r(varlist)'

* Display the structure of the data
describe

* Show the first few observations
list in 1/10  

Tidy data

Illustrations from Julia Lowndes and Allison Horst

  • Each variable is a column; each column is a variable.

  • Each observation is a row; each row is an observation.

  • Each value is a cell; each cell is a single value.

  • Each table contains one data set.

  • Long form makes it easier to reshape in many different ways

  • Wider forms are common for analysis

Long form: one measured value per row. All other variables are descriptors (key variables)

Widest form: all measured values for an entity are in a single row.

Tidy data

Steps to wrangle to tidy form:

  1. Select only the variables containing sex and age counts
  2. Pivot into long form
  3. Extract variables from names (agesex column)
  4. Tidy age codes

Is count a variable?

# A tibble: 12 × 4
    year sex   age   count
   <dbl> <chr> <fct> <dbl>
 1  1997 m     0-14      1
 2  1997 m     15-24     8
 3  1997 m     25-34    24
 4  1997 m     35-44    18
 5  1997 m     45-54    13
 6  1997 m     55-64    17
 7  1997 m     > 65     28
 8  1997 f     0-14      0
 9  1997 f     15-24    10
10  1997 f     25-34    15
11  1997 f     35-44     9
12  1997 f     45-54     5
tb_tidy <- tb |>
  select(-new_sp, -new_sp_m04, -new_sp_m514, 
                  -new_sp_f04, -new_sp_f514) |> 
  pivot_longer(starts_with("new_sp"), 
    names_to = "sexage", 
    values_to = "count") |>
  mutate(sexage = str_remove(sexage, "new_sp_")) |>
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  ) |>
  filter(age != "u") |>
  mutate(age = fct_recode(age, "0-14" = "014",
                          "15-24" = "1524",
                          "15-24" = "1524",
                          "25-34" = "2534",
                          "35-44" = "3544",
                          "45-54" = "4554",
                          "55-64" = "5564",
                          "> 65" = "65"))
tb_tidy |> slice_head(n=12)
 tb_dt_tidy <- tb_dt |>
   melt(id.vars = "year") |>
* Drop specified variables
drop new_sp new_sp_m04 new_sp_m514 new_sp_f04 new_sp_f514

* Reshape data from wide to long format
reshape long new_sp_, i(year) j(sexage) string

* Rename reshaped variable
rename new_sp_ count

* Remove "new_sp_" prefix from sexage
replace sexage = subinstr(sexage, "new_sp_", "", .)

* Separate sexage into sex and age
gen sex = substr(sexage, 1, 1)
gen age = substr(sexage, 2, .)

* Drop original sexage variable
drop sexage

* Recode age variable
replace age = "0-14" if age == "014"
replace age = "15-24" if age == "1524"
replace age = "25-34" if age == "2534"
replace age = "35-44" if age == "3544"
replace age = "45-54" if age == "4554"
replace age = "55-64" if age == "5564"
replace age = "> 65" if age == "65"
replace age = "unknown" if age == "u"

* Convert age to a labeled factor variable
encode age, gen(age_factor)

* List the first few observations to check the result
list in 1/10

Challenge 1

Data on World Development Indicators (WDI) from World Bank.


Rows: 4,793
Columns: 23
$ `Country Name`  <chr> "Afghanistan", "Afghanistan",…
$ `Country Code`  <chr> "AFG", "AFG", "AFG", "AFG", "…
$ `Series Name`   <chr> "Access to clean fuels and te…
$ `Series Code`   <chr> "EG.CFT.ACCS.ZS", "EG.CFT.ACC…
$ `2004 [YR2004]` <chr> "10.5", "1.9", "45.3", "NA", …
$ `2005 [YR2005]` <chr> "11.9", "2.4", "50.2", "NA", …
$ `2006 [YR2006]` <chr> "13.5", "3", "54.7", "NA", "1…
$ `2007 [YR2007]` <chr> "15.1", "3.6", "59.2", "NA", …
$ `2008 [YR2008]` <chr> "16.6", "4.3", "62.9", "NA", …
$ `2009 [YR2009]` <chr> "18.3", "5.1", "66.4", "NA", …
$ `2010 [YR2010]` <chr> "19.9", "5.9", "69.4", "NA", …
$ `2011 [YR2011]` <chr> "21.3", "7", "72", "NA", "2.5…
$ `2012 [YR2012]` <chr> "22.9", "8", "74.3", "NA", "2…
$ `2013 [YR2013]` <chr> "24.5", "9", "76.1", "NA", "3…
$ `2014 [YR2014]` <chr> "26.1", "10.2", "78", "NA", "…
$ `2015 [YR2015]` <chr> "27.6", "11.4", "79.5", "NA",…
$ `2016 [YR2016]` <chr> "28.8", "12.6", "80.5", "NA",…
$ `2017 [YR2017]` <chr> "30.3", "13.5", "81.6", "NA",…
$ `2018 [YR2018]` <chr> "31.4", "14.5", "82.6", "NA",…
$ `2019 [YR2019]` <chr> "32.6", "15.6", "83.2", "NA",…
$ `2020 [YR2020]` <chr> "33.8", "16.4", "83.8", "NA",…
$ `2021 [YR2021]` <chr> "34.9", "17.4", "84.5", "NA",…
$ `2022 [YR2022]` <chr> "36.1", "18.5", "85", "NA", "…
  • What are the variables?
  • What are the steps needed to wrangle it into tidy form?

Challenge 2

Melbourne weather data from NOAA.


Rows: 1,593
Columns: 128
$ V1   <chr> "ASN00086282", "ASN00086282", "ASN000862…
$ V2   <int> 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ V3   <int> 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 1…
$ V4   <chr> "TMAX", "TMIN", "PRCP", "TMAX", "TMIN", …
$ V5   <int> 141, 80, 3, 145, 50, 0, 168, 19, 0, 189,…
$ V6   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V7   <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V8   <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V9   <int> 124, 63, 30, 128, 61, 66, 168, 29, 0, 19…
$ V10  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V11  <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V12  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V13  <int> 113, 36, 0, 150, 75, 0, 162, 62, 0, 204,…
$ V14  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V15  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V16  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V17  <int> 123, 57, 0, 122, 67, 53, 162, 81, 0, 267…
$ V18  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V19  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V20  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V21  <int> 148, 69, 36, 109, 41, 13, 162, 81, 3, 25…
$ V22  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V23  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V24  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V25  <int> 149, 47, 3, 112, 51, 3, 150, 55, 5, 228,…
$ V26  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V27  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V28  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V29  <int> 139, 84, 0, 116, 48, 8, 184, 73, 0, 237,…
$ V30  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V31  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V32  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V33  <int> 153, 78, 0, 142, -7, 0, 179, 97, 38, 144…
$ V34  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V35  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V36  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V37  <int> 123, 49, 10, 166, 56, 0, 109, 72, 43, 16…
$ V38  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V39  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V40  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V41  <int> 108, 42, 23, 127, 62, 0, 125, 16, 18, 19…
$ V42  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V43  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V44  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V45  <int> 119, 48, 3, 117, 47, 3, 118, 46, 10, 233…
$ V46  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V47  <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V48  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V49  <int> 112, 56, 0, 127, 33, 5, 143, 72, 0, 178,…
$ V50  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V51  <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V52  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V53  <int> 126, 51, 5, 159, 67, 0, 149, 70, 18, 179…
$ V54  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V55  <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V56  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V57  <int> 112, 36, 0, 143, 84, 0, 155, 76, 0, 137,…
$ V58  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V59  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V60  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V61  <int> 115, 44, 0, 114, 11, 64, 118, 52, 53, 17…
$ V62  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V63  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V64  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V65  <int> 133, 39, 0, 65, 41, 3, 141, 34, 13, 209,…
$ V66  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V67  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V68  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V69  <int> 134, 40, 0, 113, 18, 99, 152, 67, 0, 192…
$ V70  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V71  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V72  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V73  <int> 126, 58, 0, 125, 50, 36, 118, 51, 8, 204…
$ V74  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V75  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V76  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V77  <int> 104, 15, 8, 129, 22, 8, 122, 29, 3, 189,…
$ V78  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V79  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V80  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V81  <int> 143, 33, 0, 147, 28, 0, 156, -11, 3, 145…
$ V82  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V83  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V84  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V85  <int> 141, 51, 18, 161, 74, 0, 155, 24, 0, 188…
$ V86  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V87  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V88  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V89  <int> 134, 74, 0, 168, 94, 0, 128, 82, 150, 15…
$ V90  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V91  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V92  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V93  <int> 117, 39, 0, 178, 73, 8, 104, 85, 66, 168…
$ V94  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V95  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V96  <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V97  <int> 142, 66, 0, 161, 88, 36, 123, 49, 69, 11…
$ V98  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V99  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V100 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V101 <int> 158, 78, 0, 145, 50, 25, 120, -10, 0, 14…
$ V102 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V103 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V104 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V105 <int> 149, 36, 13, 142, 48, 30, 145, -6, 0, 21…
$ V106 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V107 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V108 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V109 <int> 133, 61, 3, 137, 54, 56, 153, 39, 0, 241…
$ V110 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V111 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V112 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V113 <int> 143, 46, 0, 150, 78, 5, 175, 69, 5, 221,…
$ V114 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V115 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V116 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V117 <int> 150, 42, 25, 120, 47, 69, 150, 45, 0, 13…
$ V118 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V119 <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V120 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V121 <int> 145, 63, 0, 114, 18, 3, 178, 23, 0, 161,…
$ V122 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V123 <chr> " ", " ", " ", " ", " ", " ", " ", " ", …
$ V124 <chr> "a", "a", "a", "a", "a", "a", "a", "a", …
$ V125 <int> 115, 39, 3, 129, 39, 20, -9999, -9999, -…
$ V126 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V127 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ V128 <chr> "a", "a", "a", "a", "a", "a", " ", " ", …
  • What are the variables?
  • What are the steps needed to wrangle it into tidy form?

Why do it?

Illustrations from Julia Lowndes and Allison Horst

Tidy data is the starting point for statistical analysis, and data visualisation.


Read more from tidy paper and wrangling paper.

Tidy data = statistical data



\[\begin{align} X = \left[ \begin{array}{cccc} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{np} & x_{n2} & \dots & x_{np} \end{array} \right] \end{align}\]

Variables \(x_1, x_2, ..., x_p\) are in the columns. And we have \(n\) observations.


Graphics built on tidy data, fit nicely with your statistical analysis too.

Grammatical descriptions for plots

Grammar

A grammar of graphics maps the variables from a tidy data set to elements of the plot.

It’s like having the DNA rather than a species name, so you know how the plots are related to each other.

Same script can be applied to different data.

plot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <SCALE> +
  <THEME>
tb_yr <- tb_tidy |>
  group_by(year) |>
  summarise(count = sum(count, na.rm=TRUE)) 
gg1 <- ggplot(tb_yr, aes(x=year, y=count)) +
  geom_col() +
  ylim(c(0, 350))
gg2 <- ggplot(tb_yr, aes(x=year, y=count)) +
  geom_point() +
  geom_smooth(se=F) +
  ylim(c(0, 350))
gg1 + gg2 + plot_layout(ncol=1)
* Collapse data to get yearly totals
collapse (sum) count, by(year)

* Generate column plot
graph bar (asis) count, over(year) ///
    title("TB Cases by Year") ///
    ytitle("Count") ///
    name(g1, replace)

* Generate scatter plot with smoothed line
twoway (scatter count year) ///
       (lowess count year), ///
    title("TB Cases by Year") ///
    ytitle("Count") ///
    name(g2, replace)

* Combine the two graphs vertically
graph combine g1 g2, col(1) ysize(10)
  

What would be the grammar?

answer
x = Democrat
y = Margin
geom = boxplot


answer
x = year
y = count
colour: age
geom = lm

What plot does this produce?

MAPPING: x=year, y=prop, colour=country
FACET: age
GEOM: point, lm

Make the data do the work for your visualisation

# A tibble: 10 × 4
    year age       m     f
   <dbl> <fct> <dbl> <dbl>
 1  1997 0-14      1     0
 2  1997 15-24     8    10
 3  1997 25-34    24    15
 4  1997 35-44    18     9
 5  1997 45-54    13     5
 6  1997 55-64    17    10
 7  1997 > 65     28    12
 8  1998 0-14      0     2
 9  1998 15-24    11    19
10  1998 25-34    22    24
tb_bad |> 
  ggplot() + 
    geom_point(aes(x=year, y=m), colour = "#A39000") +
    geom_point(aes(x=year, y=f), colour = "#93B3FE")
* Create the scatter plot
twoway (scatter m year, mcolor("#A39000") msymbol(O)) ///
       (scatter f year, mcolor("#93B3FE") msymbol(O)), ///
       legend(order(1 "Male" 2 "Female")) ///
       title("TB Cases by Year and Gender") ///
       xtitle("Year") ytitle("Number of Cases")

# A tibble: 10 × 4
    year sex   age   count
   <dbl> <chr> <fct> <dbl>
 1  1997 m     0-14      1
 2  1997 m     15-24     8
 3  1997 m     25-34    24
 4  1997 m     35-44    18
 5  1997 m     45-54    13
 6  1997 m     55-64    17
 7  1997 m     > 65     28
 8  1997 f     0-14      0
 9  1997 f     15-24    10
10  1997 f     25-34    15
tb_tidy |> 
  ggplot() + 
    geom_point(aes(x=year, y=count, colour=sex))

Doesn’t really do mappings nicely

* Encode the sex variable if it's not already numeric
encode sex, generate(sex_num)

* Create a custom color scheme
colorpalette tableau, nograph
local colors `r(p)'

* Create the scatter plot
twoway (scatter count year if sex == "m", mcolor("`r(p1)'") msymbol(O)) ///
       (scatter count year if sex == "f", mcolor("`r(p2)'") msymbol(O)), ///
       legend(order(1 "Male" 2 "Female")) ///
       title("TB Cases by Year and Sex") ///
       xtitle("Year") ytitle("Number of Cases")

Cognitive perception principles

Hierarchy of mappings



Cleveland and McGill (1984)



Illustrations made by Emi Tanaka

Hierarchy of mappings

Based on the accuracy with which readers returned the numerical values.

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

Primary mapping used in common plots

  1. scatterplot, barchart
  2. side-by-side boxplot, stacked barchart
  3. piechart, rose plot, gauge plot, donut, wind direction map, starplot
  4. treemap, bubble chart, mosaicplot
  5. chernoff face
  6. choropleth map

Proximity

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Change blindness (1/2)

Making comparisons across plots requires the eye to jump from one focal point to another. It may result in not noticing differences.


Change blindness (2/2)


Help the reader remember what the pattern is in other panels by under-plotting all.

Too many colours, too busy

Pre-attentive

Can you find the odd one out?

Is it easier now?

Colour palettes should match variable type

There are three basic choices of palettes:

  • qualitative
  • sequential
  • diverging
  • (rainbow)
  • (palindrome)

Which one you choose depends on the

  • data values
  • and what to emphasize

Resources for exploring color:

palindrome: for confidence intervals, symmetric values

Example from the fable package. See unfinished palette work here.

rainbow palettes (1/2)

Jet rainbow palette

Produces false detail, banding and color blindness ambiguity.

viridis palettes

Have a uniform scale, match grey scale ladder. The turbo palette alleviates Jet rainbow palette problems.

rainbow palettes (2/2)

Jet rainbow palette

Produces false detail, banding and ambiguity.

viridis palettes

Colors still readable and following scale.

Transforming colour scales

If the variable mapped to colour has a right-skewed distribution, consider transforming it using a log or a square root.


This is the same data, where count has been transformed using square root.

Order categorical variables by the statistic

❌ Default: alphabetical

Full scale of number

✅ Order by statistic

Read more about OECD PISA

Do the calculation for the reader

Famous example: trade between England and the East Indies in the 18th century

Where is the biggest difference?




Read more at the History of Data Visualisation.

Do the calculation for the reader

  • Before and after treatment weight for anorexia patients
  • Three different treatments

  • Compute the difference
  • Compare difference relative to before weight
  • Before weight is used as the baseline

Challenge 1

Let’s play a game!

Which plot wore it better?


For the question

Which country is managing TB best?

Challenge 2

Take the following plot, and make it more difficult to read.

Think about what is it you learn from the plot, and how

  • changing the mapping,
  • using colour, or
  • the geom type

might change what you learn.

  1. What is the main message
  2. What does the underlying data look like
  3. How are the variables mapped
  4. What would be alternative mappings
  5. What other geoms might be used

What changes would you make for it to be easier to read?

Polishing your plots

Styling

The BBC cookbook has good basic advice. The work of Amanda Cox has been instrumental in the NY Times data visualisations.


Elements that are important in plot design are many.

  • The data should pops to be the pre-attentive element
  • Grids are important for lining up values with axis values
  • Mapping of data should be appropriate
  • Legends or callouts/annotations
  • Axis text: don’t repeat yourself (e.g. “%” or “000, 000” at each tick mark)
  • Aspect ratio: square, short and wide, tall and skinny
  • Colour choices and application
  • Titles for journalism but captions for science
  • Small multiples
  • Scales allow for comparison
  • Layering order
  • Information is meaningful to the intended audience

Challenges



How would you fix these plots?

Accessibility

Aspects assist going beyond barriers:

  • html format: screen readers manage
  • colour: checks for colour blind readers
  • alt-text: describes the visual
  • sound: plays the visual

Benefits of html format:

Script the check makes it part of your workflow, e.g. colorspace R package.

clrs <- divergingx_hcl(palette="Zissou 1", n=7)
clrs <- deutan(divergingx_hcl(palette="Zissou 1", n=7))

Some sites allow manual upload of images.

fig-alt: "Three hexagon binned plots. The plot on the left is relatively uniform in colour, and looks like a disk, and the plot on the right has a high concentration of pink hexagons in the center, and rings of green and navy blue around the outside. The middle plot is in between the two patterns."

This is a lovely example. A lot more work is needed.

Simple efforts like marking slide progression with sound.

More examples to come.

Adding interactivity

General principles

  • Additional information on-demand
  • De-clutter
  • Engage readers
  • Response needs to be fast
  • Be careful not to inflate plot file size



  • Animation, alternative to interactivity, keeps control with the developer rather than the reader

Notice also the subsetting legend

De-tangle complexity

Connecting information between plots

Graphical User Interfaces

GUIs provide explicit control over a small range of interactions.

  • Menu: for a medium number of categories
  • Slider: numeric values or range
  • Checkbox: for a small number of categories

Animation

Size maps population, hue maps continent, saturation maps country.

Animation over time, mapped to year.

People have short memory. Help memory using persistence of objects, using interpolation, shadows, fading, …

Resources

End of session 1

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.